The data used as example in this vignette was generated in the work of Liang et al.5 in the liverwort Marchantia polymorpha.

Input data

The RNAseqEasy package is thought to be used from Salmon mapped and quantified data. The first required input is the path [1] of the directory where quant.sf samples are saved will be the input. The other input necessary for the analysis is a data frame [2] that includes the name of the samples and the information describing them (i.e. variables or conditions defining the experiment).

Salmon data

Salmon4 is a widely used software for quantifying transcript abundance. A well detailed tutorial can be found in their official webpage. It is a very fast tool for transcript quantification directly from fastq.gzfiles. The only requirement is the creation of an index of the transcriptome of the organism we are working with (do not use the genome!!!), also explained in their tutorial.

As a result, we will obtain a folder [1] containing one folder per analyzed sample with its names. In each folder, the main output fail will be named quant.sf, which contains the name of each transcript and their abundance in Transcripts Per Million (TPM), among other information (length, effective length and number of reads). The path to this output folder will be input for our analysis [1].

# URL from Zenodo
data_url <- "https://zenodo.org/records/15800134/files/GSE275561.zip?download=1"

# Temporal directory to download data
output_dir <- file.path(tempdir(), "GSE275561")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

zip_file <- file.path(tempdir(), "GSE275561.zip")

# Download zip file with Salmon quantified data from Zenodo
download.file(url = data_url, destfile = zip_file, mode = "wb")

# Unzip compressed file
unzip(zip_file, exdir = output_dir)

list.files(output_dir)
## [1] "__MACOSX"              "02_Salmon"             "Sample_Data_Wendy.txt"

So, we set the “02_Salmon” folder as sampleDir, which will direct the analysis to the directory to retrieve quantified data for each sample.

sampleDir <- file.path(output_dir, "02_Salmon")

Sample table

For each experiment analysis, it is required to generate a data frame [2] (we call it sample_table) which correlates the name of each sequenced sample with the variables describing it in the experiment. So, it will include a Sample column including sample names, and an extra column for each variable of factor that were included in the experiment design. In our example, we include 12 different samples (Wendy1 to Wendy12), and there were 3 different variables describing them:

  • Genotype. Samples were divided in two different genotypes: ‘Tak1’ (wild-type organisms) and ‘gh3a’ (CRISPR-Cas9 mutants).
  • Treatment. Two different treatments were applied to samples: ‘Mock’ or ‘OPDA’ (exposure to a high concentration of a plant stress hormone from jasmonic acid family).
  • Replicate. Each category of samples included 3 independent biological replicates.
# Data frame with 'Sample', 'Genotype', 'Treatment and 'Replicate' information

sample_table <- data.frame(
  Sample = paste0("Wendy", seq(1,12)),
  Genotype = rep(c("Tak1", "gh3a"), each = 6),
  Treatment = rep(c("Mock", "OPDA"), each = 3),
  Replicate = seq(1,3)
)

sample_table
##     Sample Genotype Treatment Replicate
## 1   Wendy1     Tak1      Mock         1
## 2   Wendy2     Tak1      Mock         2
## 3   Wendy3     Tak1      Mock         3
## 4   Wendy4     Tak1      OPDA         1
## 5   Wendy5     Tak1      OPDA         2
## 6   Wendy6     Tak1      OPDA         3
## 7   Wendy7     gh3a      Mock         1
## 8   Wendy8     gh3a      Mock         2
## 9   Wendy9     gh3a      Mock         3
## 10 Wendy10     gh3a      OPDA         1
## 11 Wendy11     gh3a      OPDA         2
## 12 Wendy12     gh3a      OPDA         3

In differential expression analysis, comparisons are performed against a reference level. By default, R set levels based on alphabetical order for each variable. We can set the specific comparison we want to perform in each analysis, but it is a good practice to set levels order as we want to before we continue. We transform each variable column into factor and establish the level order. In this example, “TaK-1” is the reference genotype, and “Mock” is the reference treatment.

# Convert 'Genotype' into factor, setting "Tak1" as reference
sample_table$Genotype <- factor(sample_table$Genotype,
                                levels = c("Tak1", "gh3a"))

# Convert 'Treatment' into factor, setting "Mock" as reference
sample_table$Treatment <- factor(sample_table$Treatment, levels = c("Mock", "OPDA"))

str(sample_table)
## 'data.frame':    12 obs. of  4 variables:
##  $ Sample   : chr  "Wendy1" "Wendy2" "Wendy3" "Wendy4" ...
##  $ Genotype : Factor w/ 2 levels "Tak1","gh3a": 1 1 1 1 1 1 2 2 2 2 ...
##  $ Treatment: Factor w/ 2 levels "Mock","OPDA": 1 1 1 2 2 2 1 1 1 2 ...
##  $ Replicate: int  1 2 3 1 2 3 1 2 3 1 ...

Functional annotation

Everything else required for the analysis is not specific of the experiment you are performing, but depends on the organism you are working with. There are two different files that we will need to import for the analysis:

  • A data frame that correlates each transcript name with its corresponding gene name.
  • A file including functional annotation, i.e., correlating each gene/transcript to GO terms.

In this example, we are working with samples extracted from the liverwort Marchantia polymorpha, a model plant widely use to study how ancient or conserved biological processes are in plant evolution. To get the reference transcriptome, we download it from marchantia.info, the Genome Database for Marchantia polymorpha. The get.fasta.name() function from phylotools6 package allows us to obtain the transcript names from the fasta file. In Marchantia, gene names and transcript share the same nomenclature, only adding a dot and a number to the different transcripts of a gene. To correlate them, we create a data frame including a column called Name containing transcript names, and based on that we create a new column called GENEID removing last 2 characters (using str_sub() function from string package).

library(phylotools)
library(tidyverse)
URL <- gzcon(url(paste("https://marchantia.info/download/MpTak_v6.1/", "MpTak_v6.1r1.mrna.fasta.gz", sep="")))
txt <- readLines(URL)
Marchantia7_Transcripts <- get.fasta.name(textConnection(txt), clean_name = FALSE)
head(Marchantia7_Transcripts)
## [1] "Mp1g00070.1" "Mp1g00080.1" "Mp1g00090.1" "Mp1g00090.2" "Mp1g00090.3"
## [6] "Mp1g00090.4"
Marchantia7_tx2gene <- data.frame(Name = Marchantia7_Transcripts) %>%
  tidyr::separate(Name, sep = " ", into = c("TXNAME", "CDS")) %>%
  dplyr::mutate(GENEID = stringr::str_sub(TXNAME, 1, -3)) %>%
  dplyr::select(-CDS)

head(Marchantia7_tx2gene, 10)
##         TXNAME    GENEID
## 1  Mp1g00070.1 Mp1g00070
## 2  Mp1g00080.1 Mp1g00080
## 3  Mp1g00090.1 Mp1g00090
## 4  Mp1g00090.2 Mp1g00090
## 5  Mp1g00090.3 Mp1g00090
## 6  Mp1g00090.4 Mp1g00090
## 7  Mp1g00090.5 Mp1g00090
## 8  Mp1g00090.6 Mp1g00090
## 9  Mp1g00100.1 Mp1g00100
## 10 Mp1g00110.1 Mp1g00110

For this example, functional annotation of Gene Ontologies (GOs) for Marchantia genes can be downloaded from the package in a csv file. Many organisms reference databases include this kind of files. If you cannot find them, there are tools and tutorial to annotate your reference genome.

library(RNAseqEasy)

GO_data <- system.file("extdata",
                      "Mpo6.1_GO_db_GeneGO.csv",
                      package="RNAseqEasy", mustWork=TRUE)
Mpo_GO_GOSLIM <- read.csv(GO_data, sep = "\t")
head(Mpo_GO_GOSLIM)
##   Transcript         GO
## 1  Mp1g00060 GO:0003676
## 2  Mp1g00060 GO:0032774
## 3  Mp1g00060 GO:0034062
## 4  Mp1g00060 GO:0042644
## 5  Mp1g00060 GO:0046914
## 6  Mp1g00070 GO:0000075

This annotation file is a two column data frame correlating transcript names and GO terms (one assotiation per row). For topGO GO enrichement analyses, we will need to convert it into a list containing transcript as keys and all their annotated GO terms as values. The load_topGO_db() function included in this package helps us in this purpose.

geneID2GO <- load_topGO_db(Mpo_GO_GOSLIM, "Transcript", "GO")
head(geneID2GO)
## $Mp1g00060
## [1] "GO:0003676" "GO:0032774" "GO:0034062" "GO:0042644" "GO:0046914"
## 
## $Mp1g00070
##  [1] "GO:0000075" "GO:0000123" "GO:0000725" "GO:0000790" "GO:0000800"
##  [6] "GO:0000976" "GO:0001894" "GO:0003682" "GO:0003700" "GO:0004842"
## [11] "GO:0005704" "GO:0005759" "GO:0005813" "GO:0005829" "GO:0005886"
## [16] "GO:0006302" "GO:0007623" "GO:0008023" "GO:0008157" "GO:0008270"
## [21] "GO:0009653" "GO:0010074" "GO:0010332" "GO:0016458" "GO:0016607"
## [26] "GO:0031057" "GO:0031062" "GO:0031436" "GO:0032409" "GO:0032786"
## [31] "GO:0034243" "GO:0035065" "GO:0036464" "GO:0040029" "GO:0042127"
## [36] "GO:0042800" "GO:0042803" "GO:0042981" "GO:0043970" "GO:0044030"
## [41] "GO:0044093" "GO:0044648" "GO:0044666" "GO:0045322" "GO:0045944"
## [46] "GO:0048872" "GO:0051050" "GO:0051053" "GO:0051569" "GO:0065003"
## [51] "GO:0070531" "GO:0070577" "GO:0071339" "GO:0071479" "GO:0080182"
## [56] "GO:0099402" "GO:1902750" "GO:1903507" "GO:1990904" "GO:2000113"
## [61] "GO:2001025" "GO:2001038"
## 
## $Mp1g00080
## [1] "GO:0005794" "GO:0009570" "GO:0009941"
## 
## $Mp1g00090
## [1] "GO:0005773" "GO:0005829" "GO:0012505" "GO:0031410" "GO:0034272"
## [6] "GO:0098588" "GO:0098805"
## 
## $Mp1g00100
##  [1] "GO:0000137" "GO:0000138" "GO:0005634" "GO:0005768" "GO:0005797"
##  [6] "GO:0005802" "GO:0007155" "GO:0008194" "GO:0009507" "GO:0009832"
## [11] "GO:0010214" "GO:0010395" "GO:0031224" "GO:0045489" "GO:0070592"
## [16] "GO:0080157"
## 
## $Mp1g00110
##  [1] "GO:0000491" "GO:0001094" "GO:0005634" "GO:0005737" "GO:0042802"
##  [6] "GO:0051117" "GO:0070062" "GO:0070761" "GO:0097159" "GO:1901363"

To perform GO semantic clustering, the analysis is based on rrvgo7 and GOSemSim8 packages. The available species for this analysis are included in Bioconductor webpage. The only plant organism included is Arabidopsis thaliana, so we will use its code (org.At.tair.db) for this analysis. We can save this data in an object, so it will save computing time in the following analysis. There are three classes of GO terms: Biological Processes (BP), Molecular Functions (MF) and Cell Components (CC). In this example, we will only be focused on biological processes.

save <- GOSemSim::godata("org.At.tair.db", ont="BP")

Differential gene expression analysis

At this point, we already have everything we need to perform our RNA-seq analysis: a directory with our quantified samples (sampleDir), a data frame with the experiment metadata (sample_table), a data frame correlating gene names and transcript names for our organism (Marchantia7_tx2gene), an annotation list correlating transcript names and their annotated GO terms (geneID2GO) and the model organism reference data for the semantic clustering (save).

We want to obtain differential expressed genes (DEGs) in Tak-1 genotype in OPDA treatment compared to Tak-1 genotype in Mock conditions. To organize our results, we will create a folder to save all differential expression analyses and we will call it “DESeq2”, and inside we will create another folder to save specifically this comparison, and we will call it “Tak1_OPDA”.

dir.create(file.path(output_dir, "DESeq2"))
dir.create(file.path(output_dir, "DESeq2", "Tak1_OPDA"))
output_Dir_DESeq2 <- file.path(output_dir, "DESeq2", "Tak1_OPDA")

To run the analysis, we should install DESeq2 and topGO packages from Bioconductor and load them in our R session.

Now, we are ready to run our differential expression analysis. For that, we will use DESeq2_simple function. We set all required parameters that we previously defined (output_path, sampleDir, sample_table, tx2gene, geneID2GO, orgdb and semdata). But, still, there are some parameters that we have to define for this analysis:

For more information about contrasts, read the DESeq2 vignette and check this tutorial.

library(DESeq2)
library(topGO)

DESeq2_simple(
    output_path = output_Dir_DESeq2,
    sampleDir = sampleDir,
    sample_table = sample_table,
    tx2gene = Marchantia7_tx2gene,
    geneID2GO = geneID2GO,
    orgdb = "org.At.tair.db",
    semdata = save,
    Name = "Tak1_OPDA",
    Variable = c("Genotype", "Treatment"),
    Design = "Genotype + Treatment + Genotype:Treatment"
  )
## [1] These are the DESeq2 contrasts, that you have to use like list(c("ContrastA", "ContrastB")) : 
## [1] "Intercept"                  "Genotype_gh3a_vs_Tak1"     
## [3] "Treatment_OPDA_vs_Mock"     "Genotypegh3a.TreatmentOPDA"
## [1] These are the costumized contrasts, that you have to use like ContrastA - ContrastB : 
## [1] "Tak1"      "gh3a"      "Mock"      "OPDA"      "Tak1_Mock" "Tak1_OPDA"
## [7] "gh3a_Mock" "gh3a_OPDA"
## Error in DESeq2_simple(output_path = output_Dir_DESeq2, sampleDir = sampleDir, : Choose one of the previous contrasts

After running DESeq2_simple with no Contrast parameter, there is an error telling us we have to choose one of the previous contrasts printed in the console. In this example, we want to compare Tak-1 genotype in OPDA treatment samples vs. Tak-1 genotype in Mock conditions. There are two different ways we can set this contrast:

Tak1_OPDA_result <- DESeq2_simple(
    output_path = output_Dir_DESeq2,
    sampleDir = sampleDir,
    sample_table = sample_table,
    Include = NULL,
    Exclude = NULL,
    tx2gene = Marchantia7_tx2gene,
    Variable = c("Genotype", "Treatment"),
    Design = "Genotype + Treatment + Genotype:Treatment",
    Group = "NO",
    Name = "Tak1_OPDA",
    Contrast = Tak1_OPDA - Tak1_Mock,
    Reduced = FALSE,
    log2FCtopGO = 1,
    geneID2GO = geneID2GO,
    ontology = "BP",
    plot_similarity = TRUE,
    orgdb = "org.At.tair.db",
    semdata = save
  )

If we want to perform other possible analyses, these would be the contrast we would have to use:

dir.create(file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction"))
output_Dir_interaction <- file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction")
Genotype_Treatment_result <- DESeq2_simple(
    output_path = output_Dir_interaction,
    sampleDir = sampleDir,
    sample_table = sample_table,
    Include = NULL,
    Exclude = NULL,
    tx2gene = Marchantia7_tx2gene,
    Variable = c("Genotype", "Treatment"),
    Design = "Genotype + Treatment + Genotype:Treatment",
    Group = "NO",
    Name = "Genotype_Treatment_interaction",
    Contrast = list(c("Genotypegh3a.TreatmentOPDA")),
    Reduced = FALSE,
    log2FCtopGO = 1,
    geneID2GO = geneID2GO,
    ontology = "BP",
    plot_similarity = TRUE,
    orgdb = "org.At.tair.db",
    semdata = save
  )

Sample selection

By default, all samples in sampleDir will be included for the differential expression analysis using DESeq2_simple function. However, it could be interesting to perform an analysis including only a subset of these samples. For example, let’s think about the scenario where we want to study DEGs by OPDA in Tak-1 genotype ignoring the existence of gh3a in the experiment. Thus, we would only want to include samples belonging to Tak-1 genotype. There are two possible ways of doing that in DESeq2_simple function: by using Include or Exclude parameters.

  • Include: we define features in a vector that all samples must present to be included in the analysis. In this example, it would be Include = c("Tak1").
  • Exclude: we define features in a vector that we want to remove from the analysis. All samples containing any of these features will be removed. In this example, it would be Exclude = c("gh3a").

In our example, both parameters would work the same way, i.e., only including Tak-1 samples in the analysis, both in OPDA and Mock treatments.

PCA

Apart from differential expression analysis, all samples included in the DESeq2_simple function will be compute for principal component analysis and we will obtain a png file of PC1 vs PC2 using 500 top genes by default. If we do not want to generate the PCA, we can set the PCA parameter to FALSE withing the DESeq2_simple function. If we want to perform a different principal component analysis (plotting other component, using a different number of top genes, changing colors, etc…) we can do it by running the run_pca_analysis function.

Output

After running DESeq2_simple function, there are some files generated with results from the differential gene expression analysis:

  • Two tab-delimited text files are generated including results from the differential expression analysis of all genes (Name.txt) and only significantly expressed genes, i.e., \(padj \le 0.05\) (Name_Sig.txt).
##            baseMean log2FoldChange     lfcSE       stat       pvalue
## Mp1g00070  147.7289    -0.87073377 0.2215426 -3.9303223 8.483207e-05
## Mp1g00080 8077.3196    -0.54069976 0.1329801 -4.0660199 4.782285e-05
## Mp1g00090  531.2465    -0.05551592 0.1356234 -0.4093389 6.822909e-01
## Mp1g00100 1127.5365     0.87159546 0.1302033  6.6941140 2.169823e-11
## Mp1g00110  419.9734    -0.10280589 0.1763512 -0.5829610 5.599196e-01
## Mp1g00120  511.5859    -0.35720864 0.1180411 -3.0261372 2.476999e-03
##                   padj
## Mp1g00070 3.932574e-04
## Mp1g00080 2.335654e-04
## Mp1g00090 7.772935e-01
## Mp1g00100 2.389950e-10
## Mp1g00110 6.791640e-01
## Mp1g00120 7.944981e-03
  • A PDF file including a heatmap of \(log2(n + 1)\) normalized values of all samples included in the analysis (Name_heatmap.pdf).
## Converting page 1 to Tak1_OPDA_heatmap_1.png... done!
Heatmap of normalized values of DEGs in the OPDA vs Mock comparison for Tak-1 genotype

Heatmap of normalized values of DEGs in the OPDA vs Mock comparison for Tak-1 genotype

  • A png file including the PCA of all samples included in the analysis, only if PCA parameter is set as the TRUE default value (PCA_Name.png).
PCA plot of PC1 vs PC2, using 500 top genes

PCA plot of PC1 vs PC2, using 500 top genes

  • A “topGO” folder, including results of GO enrichment analysis for Up, Down and Up+Down DEGs (\(padj \le 0.05\), and \(log2(FC) \ge 1\) for Up-regulated and \(log2(FC) \le -1\) for Down-regulated genes):
    • Two tab-delimited text files including results from the GO enrichment analysis. One include detailed results (topGO_Name_All/Up/Down_results.txt) and the other only enriched GO terms and their p-values (topGO_Name_All/Up/Down_pvals.txt).
##            Annotated Significant Expected
## GO:0044238      6226         509   528.04
## GO:0045893       688          59    58.35
## GO:0048574        90           9     7.63
## GO:0010154      1454         150   123.32
## GO:0044706       462          48    39.18
## GO:0044281      1878         198   159.28
##                                                          Term         pval
## GO:0044238                          primary metabolic process 0.0061337475
## GO:0045893 positive regulation of DNA-templated transcription 0.0060918982
## GO:0048574                 long-day photoperiodism, flowering 0.0049341244
## GO:0010154                                  fruit development 0.0002647838
## GO:0044706               multi-multicellular organism process 0.0321583675
## GO:0044281                   small molecule metabolic process 0.0091339926
##            Enrichment
## GO:0044238   1.006982
## GO:0045893   1.056274
## GO:0048574   1.231723
## GO:0010154   1.270691
## GO:0044706   1.279712
## GO:0044281   1.298621
  • Three PDF files for GO enrichment visualization:
    • A tree map plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Treemap.pdf).

    • A scatter plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Scatterplot.pdf).

    • A bubble plot of all enriched GO terms in descendant order according to their enrichment (topGO_Name_All/Up/Down_bubble.pdf).

Tree map of Gene Ontologies enriched in Up-regulated genes

Tree map of Gene Ontologies enriched in Up-regulated genes

Semantic cluster of Gene Ontologies enriched in Up-regulated genes

Semantic cluster of Gene Ontologies enriched in Up-regulated genes

Bubble plot of Gene Ontologies enriched in Up-regulated genes

Bubble plot of Gene Ontologies enriched in Up-regulated genes

Clustering of co-expressed genes

Another application of the package is the generation of co-expressed gene modules by the implementation of components from the WGCNA3 package. This functionality is included in the WGCNA_Modules function. As DESeq2_simple function, it requires to specify the output_path, sampleDir, sample_table, Variables, tx2gene, Name, geneID2GO and Annotation parameters. The specific parameters that have also to be included in this function are:

Most of the advanced parameters in for the blockwiseModules function of the WGCNA package for the module generation are set to default.

As an example, we will use the output DEGs from the Genotype:Treatment interaction in DESeq2_simple, i.e., genes that significantly respond to the OPDA treatment in the gh3a genotype in a different way than the Tak-1 genotype. This different response can be in different ways, so we will use the gene co-expression modules to explore this responses.

DESeq2_simple_output <- sigtable <- read.table(file.path(output_Dir_interaction, "Genotype_Treatment_interaction.txt"), header = TRUE)

DEGs <- rownames(DESeq2_simple_output)

Now that we already have selected our input gene set, we can run the WGCNA_Modules function:

library(WGCNA)
dir.create(file.path(output_dir, "WGCNA"))
output_WGCNA <- file.path(output_dir, "WGCNA")
COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
                                   sampleDir = sampleDir,
                                   sample_table = sample_table,
                                   DEGs = DEGs,
                                   Variables = c("Genotype", "Treatment"),
                                   tx2gene = Marchantia7_tx2gene,
                                   Name = "Genotype_Treatment_Interaction",
                                   NumberCol = 1,
                                   geneID2GO = geneID2GO,
                                   semdata = save)
##  Flagging genes and samples with too many missing values...
##   ..step 1
## pickSoftThreshold: will use block size 3269.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 3269 of 13684
## Warning: executing %dopar% sequentially: no parallel backend registered
##    ..working on genes 3270 through 6538 of 13684
##    ..working on genes 6539 through 9807 of 13684
##    ..working on genes 9808 through 13076 of 13684
##    ..working on genes 13077 through 13684 of 13684
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1   0.6290  1.600          0.752  5500.0   5620.00   7910
## 2      2   0.0573  0.112          0.132  3090.0   2970.00   5730
## 3      3   0.7080 -0.360          0.665  2010.0   1760.00   4530
## 4      4   0.9310 -0.576          0.913  1420.0   1110.00   3760
## 5      5   0.9680 -0.702          0.959  1060.0    738.00   3210
## 6      6   0.9720 -0.783          0.965   823.0    505.00   2800
## 7      7   0.9740 -0.840          0.967   658.0    357.00   2480
## 8      8   0.9740 -0.877          0.968   538.0    260.00   2220
## 9      9   0.9760 -0.908          0.971   448.0    193.00   2010
## 10    10   0.9740 -0.937          0.969   378.0    146.00   1830
## 11    12   0.9750 -0.976          0.975   280.0     87.00   1550
## 12    14   0.9780 -1.000          0.980   215.0     54.20   1340
## 13    16   0.9770 -1.030          0.980   170.0     36.00   1170
## 14    18   0.9720 -1.050          0.977   137.0     24.80   1040
## 15    20   0.9720 -1.070          0.979   113.0     17.50    931
## 16    22   0.9710 -1.080          0.979    94.2     12.70    841
## 17    24   0.9730 -1.090          0.981    79.7      9.42    764
## 18    26   0.9740 -1.100          0.982    68.2      7.13    699
## 19    28   0.9730 -1.110          0.982    59.0      5.47    642
## 20    30   0.9710 -1.120          0.980    51.4      4.22    593
## 21    32   0.9720 -1.120          0.981    45.2      3.32    549
## 22    34   0.9740 -1.130          0.983    39.9      2.61    511
## 23    36   0.9740 -1.130          0.983    35.5      2.09    477
## 24    38   0.9710 -1.140          0.980    31.8      1.70    446
## 25    40   0.9720 -1.140          0.981    28.6      1.39    419
## Error in WGCNA_Modules(output_path = output_WGCNA, sampleDir = sampleDir, : Choose a power based on softthresholdingpower.pdf

We did not include the Power parameter, so the function stopped at warned us to select it based on the “softthresholdingpower.pdf” file.

R^2 values as a function of the soft thresholds

R^2 values as a function of the soft thresholds

Based on this plot, we will select 22 as the soft thresholding power since it is the first value that rises the threshold.

COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
                                   sampleDir = sampleDir,
                                   sample_table = sample_table,
                                   DEGs = DEGs,
                                   Variables = c("Genotype", "Treatment"),
                                   tx2gene = Marchantia7_tx2gene,
                                   Name = "Genotype_Treatment_Interaction",
                                   Power = 22,
                                   NumberCol = 1,
                                   geneID2GO = geneID2GO,
                                   semdata = save)
##  Flagging genes and samples with too many missing values...
##   ..step 1
## pickSoftThreshold: will use block size 3269.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 3269 of 13684
##    ..working on genes 3270 through 6538 of 13684
##    ..working on genes 6539 through 9807 of 13684
##    ..working on genes 9808 through 13076 of 13684
##    ..working on genes 13077 through 13684 of 13684
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1   0.6290  1.600          0.752  5500.0   5620.00   7910
## 2      2   0.0573  0.112          0.132  3090.0   2970.00   5730
## 3      3   0.7080 -0.360          0.665  2010.0   1760.00   4530
## 4      4   0.9310 -0.576          0.913  1420.0   1110.00   3760
## 5      5   0.9680 -0.702          0.959  1060.0    738.00   3210
## 6      6   0.9720 -0.783          0.965   823.0    505.00   2800
## 7      7   0.9740 -0.840          0.967   658.0    357.00   2480
## 8      8   0.9740 -0.877          0.968   538.0    260.00   2220
## 9      9   0.9760 -0.908          0.971   448.0    193.00   2010
## 10    10   0.9740 -0.937          0.969   378.0    146.00   1830
## 11    12   0.9750 -0.976          0.975   280.0     87.00   1550
## 12    14   0.9780 -1.000          0.980   215.0     54.20   1340
## 13    16   0.9770 -1.030          0.980   170.0     36.00   1170
## 14    18   0.9720 -1.050          0.977   137.0     24.80   1040
## 15    20   0.9720 -1.070          0.979   113.0     17.50    931
## 16    22   0.9710 -1.080          0.979    94.2     12.70    841
## 17    24   0.9730 -1.090          0.981    79.7      9.42    764
## 18    26   0.9740 -1.100          0.982    68.2      7.13    699
## 19    28   0.9730 -1.110          0.982    59.0      5.47    642
## 20    30   0.9710 -1.120          0.980    51.4      4.22    593
## 21    32   0.9720 -1.120          0.981    45.2      3.32    549
## 22    34   0.9740 -1.130          0.983    39.9      2.61    511
## 23    36   0.9740 -1.130          0.983    35.5      2.09    477
## 24    38   0.9710 -1.140          0.980    31.8      1.70    446
## 25    40   0.9720 -1.140          0.981    28.6      1.39    419
##  Calculating module eigengenes block-wise from all genes
##    Flagging genes and samples with too many missing values...
##     ..step 1
##  ....pre-clustering genes to determine blocks..
##    Projective K-means:
##    ..k-means clustering..
##    ..merging smaller clusters...
## Block sizes:
## gBlocks
##    1    2    3 
## 4997 4543 4144 
##  ..Working on block 1 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 1 into file Salva-block.1.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##  ..Working on block 2 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 2 into file Salva-block.2.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##  ..Working on block 3 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 3 into file Salva-block.3.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##  ..merging modules that are too close..
##      mergeCloseModules: Merging modules whose distance is less than 0.25
##        Calculating new MEs...
## Warning in RColorBrewer::brewer.pal(length(unique_groups), "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

There are 10 different clusters that have been generated after the analysis. The average behavior of genes from each category are summarized in the “Summary_Modules_Name.pdf” file:

Summary of gene expression behavior in each co-expression module of genes responding to the Genotype:Treatment interaction

Summary of gene expression behavior in each co-expression module of genes responding to the Genotype:Treatment interaction

1.
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. 15, 550 (2014).
2.
Alexa, A. & Rahnenfuhrer, J. topGO: Enrichment analysis for gene ontology. (2021).
3.
Langfelder, P. & Horvath, S. WGCNA: An r package for weighted correlation network analysis. 559 (2008).
4.
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods 14, 417–419 (2017).
5.
6.
7.
Sayols, S. Rrvgo: A bioconductor package for interpreting lists of gene ontology terms. (2023) doi:10.17912/MICROPUB.BIOLOGY.000811.
8.
Yu, G. Gene Ontology Semantic Similarity Analysis Using GOSemSim. in 207–215 (Springer US, 2020). doi:10.1007/978-1-0716-0301-7_11.